Increasing the Action Gap: New Operators for Reinforcement Learning
نویسندگان
چکیده
This paper introduces new optimality-preserving operators on Q-functions. We first describe an operator for tabular representations, the consistent Bellman operator, which incorporates a notion of local policy consistency. We show that this local consistency leads to an increase in the action gap at each state; increasing this gap, we argue, mitigates the undesirable effects of approximation and estimation errors on the induced greedy policies. This operator can also be applied to discretized continuous space and time problems, and we provide empirical results evidencing superior performance in this context. Extending the idea of a locally consistent operator, we then derive sufficient conditions for an operator to preserve optimality, leading to a family of operators which includes our consistent Bellman operator. As corollaries we provide a proof of optimality for Baird’s advantage learning algorithm and derive other gap-increasing operators with interesting properties. We conclude with an empirical study on 60 Atari 2600 games illustrating the strong potential of these new operators. Value-based reinforcement learning is an attractive solution to planning problems in environments with unknown, unstructured dynamics. In its canonical form, value-based reinforcement learning produces successive refinements of an initial value function through repeated application of a convergent operator. In particular, value iteration (Bellman 1957) directly computes the value function through the iterated evaluation of Bellman’s equation, either exactly or from samples (e.g. Q-Learning, Watkins 1989). In its simplest form, value iteration begins with an initial value function V0 and successively computes Vk+1 := T Vk, where T is the Bellman operator. When the environment dynamics are unknown, Vk is typically replaced by Qk, the state-action value function, and T is approximated by an empirical Bellman operator. The fixed point of the Bellman operator, Q∗, is the optimal state-action value function or optimal Q-function, from which an optimal policy π∗ can be recovered. In this paper we argue that the optimal Q-function is inconsistent, in the sense that for any action a which is subop∗Now at Carnegie Mellon University. Copyright c © 2016, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. timal in state x, Bellman’s equation for Q∗(x, a) describes the value of a nonstationary policy: upon returning to x, this policy selects π∗(x) rather than a. While preserving global consistency appears impractical, we propose a simple modification to the Bellman operator which provides us a with a first-order solution to the inconsistency problem. Accordingly, we call our new operator the consistent Bellman operator. We show that the consistent Bellman operator generally devalues suboptimal actions but preserves the set of optimal policies. As a result, the action gap – the value difference between optimal and second best actions – increases. Increasing the action gap is advantageous in the presence of approximation or estimation error (Farahmand 2011), and may be crucial for systems operating at a fine time scale such as video games (Togelius et al. 2009; Bellemare et al. 2013), real-time markets (Jiang and Powell 2015), and robotic platforms (Riedmiller et al. 2009; Hoburg and Tedrake 2009; Deisenroth and Rasmussen 2011; Sutton et al. 2011). In fact, the idea of devaluating suboptimal actions underpins Baird’s advantage learning (Baird 1999), designed for continuous time control, and occurs naturally when considering the discretized solution of continuous time and space MDPs (e.g. Munos and Moore 1998; 2002), whose limit is the Hamilton-Jacobi-Bellman equation (Kushner and Dupuis 2001). Our empirical results on the bicycle domain (Randlov and Alstrom 1998) show a marked increase in performance from using the consistent Bellman operator. In the second half of this paper we derive novel sufficient conditions for an operator to preserve optimality. The relative weakness of these new conditions reveal that it is possible to deviate significantly from the Bellman operator without sacrificing optimality: an optimality-preserving operator needs not be contractive, nor even guarantee convergence of the Q-values for suboptimal actions. While numerous alternatives to the Bellman operator have been put forward (e.g. recently Azar et al. 2011; Bertsekas and Yu 2012), we believe our work to be the first to propose such a major departure from the canonical fixed-point condition required from an optimality-preserving operator. As proof of the richness of this new operator family we describe a few practical instantiations with unique properties. We use our operators to obtain state-of-the-art empirical ar X iv :1 51 2. 04 86 0v 1 [ cs .A I] 1 5 D ec 2 01 5 results on the Arcade Learning Environment (Bellemare et al. 2013). We consider the Deep Q-Network (DQN) architecture of Mnih et al. (2015), replacing only its learning rule with one of our operators. Remarkably, this one-line change produces agents that significantly outperform the original DQN. Our work, we believe, demonstrates the potential impact of rethinking the core components of value-based reinforcement learning.
منابع مشابه
Reinforcement learning based feedback control of tumor growth by limiting maximum chemo-drug dose using fuzzy logic
In this paper, a model-free reinforcement learning-based controller is designed to extract a treatment protocol because the design of a model-based controller is complex due to the highly nonlinear dynamics of cancer. The Q-learning algorithm is used to develop an optimal controller for cancer chemotherapy drug dosing. In the Q-learning algorithm, each entry of the Q-table is updated using data...
متن کاملA Multiagent Reinforcement Learning algorithm to solve the Community Detection Problem
Community detection is a challenging optimization problem that consists of searching for communities that belong to a network under the assumption that the nodes of the same community share properties that enable the detection of new characteristics or functional relationships in the network. Although there are many algorithms developed for community detection, most of them are unsuitable when ...
متن کاملHierarchical Functional Concepts for Knowledge Transfer among Reinforcement Learning Agents
This article introduces the notions of functional space and concept as a way of knowledge representation and abstraction for Reinforcement Learning agents. These definitions are used as a tool of knowledge transfer among agents. The agents are assumed to be heterogeneous; they have different state spaces but share a same dynamic, reward and action space. In other words, the agents are assumed t...
متن کاملWeb pages ranking algorithm based on reinforcement learning and user feedback
The main challenge of a search engine is ranking web documents to provide the best response to a user`s query. Despite the huge number of the extracted results for user`s query, only a small number of the first results are examined by users; therefore, the insertion of the related results in the first ranks is of great importance. In this paper, a ranking algorithm based on the reinforcement le...
متن کاملRRLUFF: Ranking function based on Reinforcement Learning using User Feedback and Web Document Features
Principal aim of a search engine is to provide the sorted results according to user’s requirements. To achieve this aim, it employs ranking methods to rank the web documents based on their significance and relevance to user query. The novelty of this paper is to provide user feedback-based ranking algorithm using reinforcement learning. The proposed algorithm is called RRLUFF, in which the rank...
متن کامل